Search CORE

841 research outputs found

clValid: An R Package for Cluster Validation

Author: Guy Brock
Somnath Datta
Susmita Datta
Vasyl Pihur
Publication venue
Publication date
Field of study

The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, "internal", "stability", and "biological". The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), and model-based clustering. In addition, we provide a function to perform the self-organizing tree algorithm (SOTA) method of clustering. Any combination of validation measures and clustering methods can be requested in a single function call. This allows the user to simultaneously evaluate several clustering algorithms while varying the number of clusters, to help determine the most appropriate method and number of clusters for the dataset of interest. Additionally, the package can automatically make use of the biological information contained in the Gene Ontology (GO) database to calculate the biological validation measures, via the annotation packages available in Bioconductor. The function returns an object of S4 class "clValid", which has summary, plot, print, and additional methods which allow the user to display the optimal validation scores and extract clustering results.

Research Papers in Economics

Computational biology touches all bases

Author: Datta Somnath
Datta Susmita
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

A report of the 6th Annual Rocky Mountain Bioinformatics Conference, Aspen, USA, 4-7 December 2008

Crossref

PubMed Central

RankAggreg, an R package for weighted rank aggregation

Author: Datta Somnath
Datta Susmita
Pihur Vasyl
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Researchers in the field of bioinformatics often face a challenge of combining several ordered lists in a proper and efficient manner. Rank aggregation techniques offer a general and flexible framework that allows one to objectively perform the necessary aggregation. With the rapid growth of high-throughput genomic and proteomic studies, the potential utility of rank aggregation in the context of meta-analysis becomes even more apparent. One of the major strengths of rank-based aggregation is the ability to combine lists coming from different sources and platforms, for example different microarray chips, which may or may not be directly comparable otherwise. Results The <it>RankAggreg </it>package provides two methods for combining the ordered lists: the Cross-Entropy method and the Genetic Algorithm. Two examples of rank aggregation using the package are given in the manuscript: one in the context of clustering based on gene expression, and the other one in the context of meta-analysis of prostate cancer microarray experiments. Conclusion The two examples described in the manuscript clearly show the utility of the <it>RankAggreg </it>package in the current bioinformatics context where ordered lists are routinely produced as a result of modern high-throughput technologies.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

SeqNet: An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data

Author: Datta Somnath
Grimes Tyler
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2021
Field of study

Gene expression data provide an abundant resource for inferring connections in gene regulatory networks. While methodologies developed for this task have shown success, a challenge remains in comparing the performance among methods. Gold-standard datasets are scarce and limited in use. And while tools for simulating expression data are available, they are not designed to resemble the data obtained from RNA-seq experiments. SeqNet is an R package that provides tools for generating a rich variety of gene network structures and simulating RNA-seq data from them. This produces in silico RNA-seq data for benchmarking and assessing gene network inference methods. The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package= SeqNet and on GitHub at https://github.com/tgrimes/SeqNet

PubMed Central

Journal of Statistical Software

UNF Digital Commons

Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values

Author: Ahn Seungjun
Datta Somnath
Publication venue
Publication date: 23/03/2023
Field of study

A recent breakthrough in differential network (DN) analysis of microbiome data has been realized with the advent of next-generation sequencing technologies. The DN analysis disentangles the microbial co-abundance among taxa by comparing the network properties between two or more graphs under different biological conditions. However, the existing methods to the DN analysis for microbiome data do not adjust for other clinical differences between subjects. We propose a Statistical Approach via Pseudo-value Information and Estimation for Differential Network Analysis (SOHPIE-DNA) that incorporates additional covariates such as continuous age and categorical BMI. SOHPIE-DNA is a regression technique adopting jackknife pseudo-values that can be implemented readily for the analysis. We demonstrate through simulations that SOHPIE-DNA consistently reaches higher recall and F1-score, while maintaining similar precision and accuracy to existing methods (NetCoMi and MDiNE). Lastly, we apply SOHPIE-DNA on two real datasets from the American Gut Project and the Diet Exchange Study to showcase the utility. The analysis of the Diet Exchange Study is to showcase that SOHPIE-DNA can also be used to incorporate the temporal change of connectivity of taxa with the inclusion of additional covariates. As a result, our method has found taxa that are related to the prevention of intestinal inflammation and severity of fatigue in advanced metastatic cancer patients.Comment: 23 pages, 2 figures, 4 table

arXiv.org e-Print Archive

clValid: An R Package for Cluster Validation

Author: Guy Brock
Somnath Datta
Susmita Datta
Vasyl Pihur
Publication venue: Foundation for Open Access Statistics
Publication date: 01/01/2008
Field of study

CiteSeerX

Directory of Open Access Journals

Journal of Statistical Software

Adjusting for informative cluster size in pseudo-value based regression approaches with clustered time to event data

Author: Anyaso-Samuel Samuel
Datta Somnath
Publication venue
Publication date: 23/03/2023
Field of study

Informative cluster size (ICS) arises in situations with clustered data where a latent relationship exists between the number of participants in a cluster and the outcome measures. Although this phenomenon has been sporadically reported in statistical literature for nearly two decades now, further exploration is needed in certain statistical methodologies to avoid potentially misleading inferences. For inference about population quantities without covariates, inverse cluster size reweightings are often employed to adjust for ICS. Further, to study the effect of covariates on disease progression described by a multistate model, the pseudo-value regression technique has gained popularity in time-to-event data analysis. We seek to answer the question: "How to apply pseudo-value regression to clustered time-to-event data when cluster size is informative?" ICS adjustment by the reweighting method can be performed in two steps; estimation of marginal functions of the multistate model and fitting the estimating equations based on pseudo-value responses, leading to four possible strategies. We present theoretical arguments and thorough simulation experiments to ascertain the correct strategy for adjusting for ICS. A further extension of our methodology is implemented to include informativeness induced by the intra-cluster group size. We demonstrate the methods in two real-world applications: (i) to determine predictors of tooth survival in a periodontal study, and (ii) to identify indicators of ambulatory recovery in spinal cord injury patients who participated in locomotor-training rehabilitation.Comment: 22 pages, 4 figures, 4 table

arXiv.org e-Print Archive